Telco Customer Churn Data Exploratory Analysis

Author : Indra Yanto

0. Setting the Work Environment

1. Data Profiling and Summarizing

The customer base dataset used in this work is made available by IBM and downloaded from Kaggle. It is related to an anonymous telecom company and contains 7043 customers data with 21 attributes where each row represents a customer and each column contains customer’s attributes. Link for the dataset : https://www.kaggle.com/blastchar/telco-customer-churn

1.1 Describing each feature

Let's describe each feature shortly:

1.2 Summarizing the Datasets

1.3 Business Objectives

The analysis is carried out to answer this problem :

2. Data Cleansing and Preparation

2.1 Checking for the data types and missing values

Based on above dataframe, it appears that :

2.2 Converting the object values on TotalCharges column

After sorting the TotalCharges, we can clearly see that the roots of error when converting the datatypes are ' ' characters. By looking at the tenure columns, we find out that these anomalies may appear because the tenure values is 0 and one month period hasn't been completed yet or maybe there are free trials. Since this anomaly only appears in 11 rows (only 0.1% of total rows), these rows will be removed.

The TotalCharges column is already detected as float64 variable

2.3 Checking for outliers

3. Data Exploratory Analysis

3.1 Univariate Analysis

First, let's create a function to calculate count and percentage of each category within one categorical variable

It can be seen that there is an imbalance for the target variable since it contains 5163 rows of No entries (73.42\%) and 1869 rows of Yes entries (26.58\%), indicates that the corresponding company has 26.57\% churn rate within the last month

Some function will be made to create the countplot more easily

One can find that within the sample, the amount of each gender is approximately equal. The same applies to the Partner variable as the number of customers with or without a partner is also roughly similar. The distribution of imbalances can be seen by other features that around 16\% of customers are considered elderly and 30\% of customers have dependents.

Contract, PaperlessBilling, and Payment method give informations about customer account and transaction method. From above figure, it is obvious that most customers prefer a month-to-month contract rather than a yearly contract. Moreover, for the transaction method, most customers opting for paperless billing and using electronic check.

There are 9 variables that provide information about the services the customer signs up for. The bar charts shows that approximately 90\% of customers subscribe to telephone service and 42\% also subscribe to multi-line service as well. It also shows that the majority of customers also subscribe to an Internet service with 44\% of customers prefer the type of fiber optic service and 34\% choose DSL. It is apparent that most customers who use the Internet service prefer not to use some additional Internet services such as OnlineSecurity, OnlineBackup, DeviceProtection and TechSupport. However, the same thing does not apply to the streaming service since the number of customers using or not using the service is similar. This also means that the streaming service is the most popular among the additional internet services.

For numeric variables, univariate analysis will be conducted using the histogram. In the preceding section, it is already known that there are no outliers in the dataset. However, the distribution of each variable is not yet known and needs to be visualized. The tenure histogram shows that the distribution of its values is considered bi-modal (2 peaks), indicates that the dataset is concentrated around 2 clusters. One of them is customers with less than 10 months tenure and the last is loyal customers, signed up for 65 months or more. From another point of view, one can also see that the company has indeed been able to attract many customers in the last 10 months. The bimodal distribution is also visible in the MonthlyCharges distribution where the data set is focused around two groups, i.e customers who purchased the basic service (cheapest price) with the amount of 20 dollars only and other is the customer who purchased multi-services with the amount of approximately 80 dollars per month. Apart from its positively skewed form, it is difficult to interpret any useful information from the TotalCharges histogram since TotalCharges is roughly computed from MonthlyCharges multiplied by tenure. Since univariate analysis does not provide enough information to answer the problems, bivariate analysis will be conducted in the next section with great focus to the target variable, Churn.

3.2 Bivariate Analysis

In this section, the analysis will begin by calculating the correlation of each variable to another with the Spearman method. Before calculating the correlation matrix, the categorical variable within the dataset first need to be encoded. As shown in the Spearman Correlation Heatmap,tenure and MonthlyCharges are highly correlated to TotalCharges, which is reasonable since TotalCharges is roughly similar to MonthlyCharges multiplied by tenure. Other interesting result is encoded variable StreamingTV_Yes has 0.53 correlation to StreamingMovies_Yes, which indicates the customer who have TV streaming service is likely to have movies streaming service also. One also can see from the heatmap that services with strongest correlation to MonthlyCharges are fiber optic with 0.8, no internet service with -0.71 and streaming service with 0.64, describing those services do have strong influences to the MonthlyCharges value. The heatmap also shows that Churn does not have quite strong correlation to any variable. The strongest correlations for the target variable Churn are -0.37 by tenure, 0.31 by fiber optic internet service, and 0.3 by electronic check.

From the stacked bar charts of customer demographic attributes, customer account attributes, and customer service attributes one can notice that :

The bivariate analysis will also be conducted for the numerical variables such as tenure, MonthlyCharges, and TotalCharges. The tenure histogram shows that most customers who churned within the last month only subscribed for less than 5 months. This is definitely a loss as the company has actually managed to attract a lot of new customers in the last few months. From MonthlyCharges histogram, it appears that the majority of lost customers are charged by approximately 70-110 dollars per month. To urge more detail information, these 2 variables will be binned by dividing their values into 6 quantiles with roughly similar amount.

Tenure binning chart appears an instinctive result as the churn likelihood gets smaller as the membership time gets longer. It also tells that more than 50\% of customers who only subscribed less than 4 months prefer to churn (mostly even churn in their first month). From MonthlyCharges binning, it can be seen that the premium customers who are billed more than 70 dollars per month are more likely to churn compared to other customers with less bill. From the business perspective, it is surely more beneficial for the company to have a great focus improving on the premium services since those services have more lost customers and donate more month-to-month income for the company. Another interesting result is the customer who only subscribes for basic service (cheapest monthly charges) seems quite satistied with the service quality and less likely to churn.

3.3 Summary and Justification

From bivariate analysis, one can figure several categories that may represent who the lost customer is. However, the question of which factors is the main culprit remains unanswered. Consider that it is impossible for customers who use electronic check for the payment method decide to churn without any reasons. In this section, that question will be tried to solve.

Above jitter plot depicts why many customers who pay more than 70 dollars per month choose to churn. The reason is due to the internet service they bought. As shown in that plot, it is evident that customers who churn with MonthlyCharges ranging from 70 to 120 dollars are usually customers who use FiberOptic service. From the previous section, one of the most important insights is Fiber Optic service indeed has higher rate of churn. That's also why SeniorCitizen, ElectronicCheck, and PaperlessBilling have higher churn rates than others. The boxplots show that MonthlyCharges distribution for SeniorCitizen Yes, PaperlessBilling Yes and Electronic check are clustered around the churn-critical area, i.e 60-90 dollars per month.

Another key takeaways from the preceding section is that many customers who join up for the streaming service choose to churn (more than other additional services such as OnlineSecurity, OnlineBackup, etc). Because the correlation matrix indicates that customers who use TV streaming services are more likely to use movies streaming services as well, these two variables will be combined to form a new variable, Streaming.

Above bar chart shows that the vast majority of customers who use internet also sign up for the streaming service. It can be seen that roughly 70\% of Fiber optic users subscribe for the streaming service. However, two important insights obtained from the previous section are both of these variables contribute to the churn decision. So in which variable does the main problem exist? Does it exist in the streaming service or the kind of the internet service itself, i.e Fiber optic?

These 2 figures appear as an answer to those questions. From these figures it is evident that the customers who do not subscribe for any Streaming services actually have slightly higher probability to churn in both DSL and Fiber optic services. Moreover, the Churn probability for Fiber optic users almost two times higher than DSL users. Both of these findings indicate that the major issue is the Fiber Optic service itself.

In addition to InternetService, tenure is another significant factor in this dataset. Above figure shows that practically almost every value of MonthlyCharges has customers who churn in their first 6 months. This suggests that the majority of new telecom customers may have negative beginning experiences. Another finding is as tenure increases the contract also increases from monthly becomes yearly. That's why in the small tenure period, mostly the contracts are monthly and customers with that contract are more likely to churn than those who have yearly contract. This finding is actually quite reasonable because monthly contract customers are frequently new and still not satisfied enough with the service quality and unsure whether they will remain loyal to the company. That's why one also can see that as the period of contract increases the churn probability decreases.

4. Conclusion

From all the results above it can be concluded that :